Sample covariance between variables \(x=x_t\) and \(y=y_t\):
\[ c_{xy} = \frac{1}{N} \sum_i^N (x_i-\overline{x}) (y_i-\overline{y}) \]
Sample cross-covariance function for positive values of lag between variables \(x_t\) and \(y_{t+k}\) (Chatfield, The Analysis of Time Series, 2004):
\[ c_{xy}(k) = \frac{1}{N} \sum_{t=1}^{N-k} (x_t-\overline{x})(y_{t+k}-\overline{y}) \]
Pearson’s correlation coefficient (sample correlation) is defined as the covariance of two variables divided by the product of their standard deviations (which are the square roots of their respective variances):
\[ r_{xy} = \frac{c_{xy}}{\sqrt{c_{xx}c_{yy}}} %= \frac{\sum (x_i-\overline{x})(y_i-\overline{y})}{\sqrt{ \sum (x_i-\overline{x})^2 \sum (y_i-\overline{y})^2 }} \]
The sample cross-correlation function: \[ r_{xy}(k) = \frac{c_{xy}(k)}{\sqrt{c_{xx}(0)c_{yy}(0)}} \]
\(c_{xx}\) and \(c_{yy}\) are the sample variances of \(x\) and \(y\), respectively.
\(c_{xx}(0)\) and \(c_{yy}(0)\) that are the sample variances of \(x_t\) and \(y_t\) respectively.
“For descriptive purposes, the relationship will be described as strong if \(|r| \geq .8\), moderate if \(.5 < |r| <.8\), and weak if \(|r| \leq .5\).” – Devore and Berk, Modern Mathematical Statistics with Applications, 2012
Anscombe’s quartet classically illustrates the pitfalls on relying on a single coefficient – always visualize your data. Consider the following four datasets:
All have similar statistical properties.
| set | mean of x | mean of y | variance of x | variance of y | correlation | intercept | slope |
|---|---|---|---|---|---|---|---|
| 1 | 9 | 7.5 | 11 | 4.127 | 0.82 | 3 | 0.5 |
| 2 | 9 | 7.5 | 11 | 4.128 | 0.82 | 3 | 0.5 |
| 3 | 9 | 7.5 | 11 | 4.123 | 0.82 | 3 | 0.5 |
| 4 | 9 | 7.5 | 11 | 4.123 | 0.82 | 3 | 0.5 |
Illustration of lag for 20.07.2013 in Lausanne:
This example is shown for illustrative purposes; such a relationship between radiation intensity and ozone concentration may be better captured by examining correlations between the daily maximum values of the two variables.
A correlation or lagged correlation in x and y may also be observed. For examples, a correlation between \(x\) and \(y\) may not be due to the causal relationship between \(x\) and \(y\), but dependent on a third variable, \(z\). This is written:
|
\[ \begin{aligned} y_t &= f_y(z_t)\\ x_t &= f_x(z_t) \end{aligned} \] |
\[ \begin{aligned} y_{t+k} &= f_y(z_t)\\ x_{t} &= f_x(z_t) \end{aligned} \] |
“Correlation does not imply causation”:
For entertainment purposes, you may wish to visit the website of Spurious correlations.
library(tidyverse)
library(chron)
source("functions_extra.R")
Sys.setlocale("LC_TIME","C")
options(stringsAsFactors=FALSE)
options(chron.year.abb=FALSE)
theme_set(theme_bw()) # just my preference for plots
Let us load data saved from Lesson 4.
df <- readRDS("data/2013/lau-zue.rds")
variables <- c("O3", "NO2", "CO", "PM10", "SO2", "NMVOC", "EC", "TEMP", "PREC", "RAD")
Before computing correlations, we will first visualize the relationships using scatter plots (or x-y plots).
Let us calculate the daily maximum values for each variable:
lf <- gather(df, variable, value, all_of(variables))
dm <- lf %>%
group_by(site, year, month, day, season, variable) %>%
summarize(value=max(value, na.rm=TRUE))
daily.max <- dcast(dm, site + year + month + day + season ~ variable)
rm(dm) # no longer needed
Recall the relationship between temperature and O3 shown in a previous lecture. Note the seasonal dependence of this relationship.
ggplot(daily.max)+
facet_grid(site~season)+
geom_point(aes(TEMP, O3))
The corresponding correlation values can be obtained with the built-in function, cor (see ?cor). Note use ofuse=“pairwise.complete.obs”` for calculation of correlation with missing values.
(cor.values <- daily.max %>% group_by(site, season) %>%
summarize(correlation=cor(TEMP, O3, use="pairwise.complete.obs")))
## # A tibble: 8 x 3
## # Groups: site [2]
## site season correlation
## <chr> <fct> <dbl>
## 1 LAU DJF 0.0760
## 2 LAU MAM 0.534
## 3 LAU JJA 0.802
## 4 LAU SON 0.527
## 5 ZUE DJF 0.180
## 6 ZUE MAM 0.630
## 7 ZUE JJA 0.843
## 8 ZUE SON 0.648
We can format this table and save to file:
(cor.values.wf <- spread(cor.values, site, correlation))
## # A tibble: 4 x 3
## season LAU ZUE
## <fct> <dbl> <dbl>
## 1 DJF 0.0760 0.180
## 2 MAM 0.534 0.630
## 3 JJA 0.802 0.843
## 4 SON 0.527 0.648
write.csv2(cor.values.wf, "correlations.csv", row.names=FALSE)
write.csv2 writes to a CSV (comma separated value) file using the European convention of defining delimiters as semicolons (;) rather than commas (,).
Or, we can visualize the values:
ggplot(cor.values)+
geom_col(aes(season, correlation))+
scale_y_continuous(limits=c(0,1))+
facet_grid(.~site)+
scale_y_continuous(expand=expansion(mult=c(0, 0.1)))
We can examine the correlation of other pollutants or meterological variables with O3.
gather(daily.max, variable, value, all_of(setdiff(variables, "O3"))) %>% # everything but ozone
ggplot+
facet_grid(site~variable, scale="free_x")+
geom_point(aes(value, O3, group=season, color=season), shape=4)
We will next look at a scatter plot between hourly measurements of CO and NO2. Why does this relationship exist?
ggplot(df)+
facet_grid(site~season)+
geom_point(aes(CO, NO2))
For the following scatterplot matrix, we will use a built-in library called lattice which is a plotting system that exists in parallel to R’s base graphics and ggplot2 package (you can check out ggpairs from the GGally package for a ggplot2-based solution). We additionally define a function to include correlation coefficients in our panels.
library(lattice)
CorrelationValue <- function(x, y, ...) {
i <- is.finite(x) & is.finite(y)
correlation <- cor(x[i], y[i])
if(is.finite(correlation)) {
cpl <- current.panel.limits()
panel.text(mean(cpl$xlim),mean(cpl$ylim),
bquote(italic(r)==.(sprintf("%.2f",correlation))),
adj=c(0.5,0.5),col="blue")
}
}
Plot daily maximum values as pairwise points (only Lausanne) using this syntax:
ix <- grepl("LAU", daily.max[["site"]], fixed=TRUE)
splom(~daily.max[ix,c("O3","NO2","CO","PM10","TEMP","PREC","RAD")] | daily.max[ix,"season"],
upper.panel = CorrelationValue,
pch=4)
Plot hourly data as pairwise points. Given the large number of points, we can “smooth” the visual representation.
ix <- grepl("LAU", df[["site"]], fixed=TRUE)
splom(~df[ix,c("O3","NO2","CO","PM10","TEMP","PREC","RAD")] | df[ix,"season"],
upper.panel = CorrelationValue,
panel = panel.smoothScatter,
pch=4)
Notice some values of correlation went up.
We can define a function to generate lagged pairs for \(k\) intervals and apply it using the do() function. If we provide hourly measurements, \(k=1\) corresponds to a lag of 1 hour, and so on.
Lag <- function(pair, k) {
out <- data.frame(lag=k, head(pair[,1],-k), tail(pair[,2],-k))
names(out)[2:3] <- colnames(pair)
out
}
lagged <- df %>%
group_by(site, season) %>%
do(rbind(Lag(.[,c("RAD","O3")], 1),
Lag(.[,c("RAD","O3")], 2),
Lag(.[,c("RAD","O3")], 3),
Lag(.[,c("RAD","O3")], 4),
Lag(.[,c("RAD","O3")], 5),
Lag(.[,c("RAD","O3")], 6)))
ggplot(lagged) +
geom_point(aes(RAD, O3, group=site, color=site), shape=4)+
facet_grid(lag~season)